In this tutorial we will see the necessary step to run benchmarks on rdflib stores. This is a following to a previous work for getting and analysing the results of some benchmarks.
This document will cover:
Loading data into a rdflib store
Running the benchmarks on the rdflib store
In [3]:
import rdflib
Here we chose to use Sleepycat as the rdflib store, so we declare the store like this
In [4]:
graph = rdflib.Graph(store='Sleepycat',
identifier='my_benchmark')
Now we need to actually open the store
In [5]:
graph.open('my_sleepycat_store',
create=True) # set to False if you already have created the store
Out[5]:
We can check where we are (to know where the store will end up) with
In [6]:
!pwd
The data we will put in the store are generated data from SP2Bench. Archives of this data are in the data
directory. They are named after the number of triples they contain.
We will do this tutorial with the graph in data/32000.n3
.
If you haven't extracted the archives, you can do so in Python:
In [7]:
from bz2 import BZ2File
data = BZ2File('../data/32000.n3.bz2')
otherwise, just declare the path to the n3 file by uncommenting the following:
In [8]:
# data = '../data/32000.n3'
We put this graph into our store with (the %time
prefix is optional):
In [9]:
%time graph.parse(data, format='n3')
Out[9]:
We can check that our graph now contains around 32k triples
In [10]:
print("Size of {0} graph is {1} triples".format(graph.identifier, len(graph)))
We are going to use BenchManager to do the benchmarks.
We will measure some of the SPARQL queries defined in bench_examples/queries.py
. So we first load the queries:
In [11]:
from queries import QUERIES
Now we setup the benchmark with the help of BenchManager
In [12]:
from ktbs_bench_manager import BenchManager
bmgr = BenchManager()
We make our store Sleepycat as a context for BenchManager, this is simply a function decorated by @bmgr.context
that must yield
a rdflib graph
In [13]:
@bmgr.context
def sleepycat():
yield graph
In this case it is quiet simple because we already created the graph
object previously. In more complex cases you may need to open, check, and close the graph inside the context (see here for an example).
Now we need to setup the bench functions for our BenchManager by decorating them with @bmgr.bench
:
In [14]:
@bmgr.bench
def qall(some_graph):
some_graph.query(QUERIES['query_all'])
@bmgr.bench
def q1(some_graph):
some_graph.query(QUERIES['q1'])
We could go on and bench all the queries in QUERIES
but this is not the purpose of this tutorial.
We have completed the setup of BenchManager, so we can now run it and output the results in a file:
In [15]:
bmgr.run('/tmp/my_bench.csv')
In the CSV file the columns are the context (in this case the only one we have setup, Sleepycat, but we can declare as much as we want), and the lines are the bench functions (in this case q1
and qall
). The intersection of a line and a column is a time result (in seconds) for one bench function against a bench context.
The results for our little bench are:
In [16]:
!cat /tmp/my_bench.csv
Don't forget to close the graph ;)
In [17]:
graph.close()
If you plan to do more advanced benchmarks on rdflib stores you should consider:
using BenchableGraph from ktbs_bench_manager
to have a consistent interface between different stores.
using the bench.py
utility to run several defined benchmarks.